Not easy to consider all possible cutpoints for all possible predictors with all possible sequences
We use high-dimensional rectangles instead of any shape for ease of interpretation.
So, use Recursive Binary splitting
top-down
greedy
top-down
We brgin at the point where all observations are part of the same region. Hence the name top-down
Greedy
Best split at a particular step. We do not care about the future. A predictor \(p\) and a cutpoint \(s\) is chosen based on which split will lead to the lowest RSS.
This is carried out recursively, over and over again.
Regression tree:
tree(formula = body_mass_g ~ ., data = penguins)
Variables actually used in tree construction:
[1] "species" "sex"
Number of terminal nodes: 4
Residual mean deviance: 97680 = 32140000 / 329
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-760.30 -219.20 15.16 0.00 220.30 815.20
Use Recursive binary splitting to grow a large tree.
Apply cost complexity pruning to get a sequence of best subtrees, as a function of \(\alpha\)
Use K-fold CV to choose \(\alpha\).
Repeat steps 1 and 2 on all but the Kth fold of training data
Evaluate the mse on the left-out Kth fold, as a fucntion of \(\alpha\). Average results of each value of \(\alpha\), choose ove that minimizes average error.
Return the subtree from step 2 that corresponds with the value of chosen \(\alpha\)
For the Boston data set from {ISLR2}, run a decision tree model that estimates the median value of owner-occupied homes in $1000. Complete the full algorith steps of growing a larger tree and choosing an alpha for the best possible subtree.
Classification Trees
Response is qualitative
For making predictions, instead of using the mean of the training observations in a region, the most commonly occuring class of training observations in a region is used.
Classification trees are grown in a similar fashion to regression treesBut, instead of RSS we use the classification error rate
Classification Error Rate (E): The fraction of training observations in a region that do not belong to the most common class.
Classification Error Rate
\[E = 1- max(\hat{p}_{mk}) \]
Here \(\hat{p}_{mk}\) is the proportion of training observations in the mth region of class k.
But, this is not sufficiently sensitive for tree growing, so we use other measures in practice.
The value of D will be near zero for \(\hat{p}_{mk}\) near zero or one
Classification Error Rate - Issues
tibble::tibble(prop_a =seq(0,1,by =0.001),prop_b =1- prop_a,max =ifelse(prop_a>=prop_b, prop_a, prop_b),E =1- max,G =2*(prop_a*prop_b),En =-1*((prop_a*log(prop_a))+(prop_b*log(prop_b)))) |> ggplot2::ggplot(ggplot2::aes(prop_a,E))+ ggplot2::geom_line() + ggplot2::geom_line(ggplot2::aes(y = G), colour ="green") + ggplot2::geom_line(ggplot2::aes(y = En), colour ="steelblue") + ggplot2::theme_minimal() + ggplot2::labs(x ="probablity of Class A at a given node",y ="Y",title ="Comparing Classificaiton Error Rate, Gini and Entropy",subtitle ="Toy example with two classes: A and B" )
Classification Error Rate - Issues
A comment on E, G and D
Best to use G or D for splitting
For pruning and gauging overall accuracy E is preferable if prediction accuracy is the goal.
The {tree} takes care of all this by default
code refrence
tree(league ~ . , data = Hitters) -> modprune(mod, method ="misclass")cv.tree(mod, FUN = prune.misclass)
Do it yourself
This link has the Heart data for paitents with chest pains. The variable AHD is a binary variable where Yes refers to existing heart disease. Create a decision tree model, a full grown tree, to understand factors for predicting AHD. Also, prune this tree by choosing the appropriate \(\alpha\). Write an interpretation note for your model. Does your final and full grown trees have some spits that estimate the same response for terminal nodes? Why does that happen?